Red Wine Data Analysis by Josias Marcos Orlando

Requirements:

Libraries:

  • ggplot2
  • GGally
  • scales
  • memisc
  • gridExtra

Dataset

The Red Wine Quality data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Input variables (based on physicochemical tests):

  • fixed acidity (tartaric acid - g / dm^3)
  • volatile acidity (acetic acid - g / dm^3)
  • citric acid (g / dm^3)
  • residual sugar (g / dm^3)
  • chlorides (sodium chloride - g / dm^3
  • free sulfur dioxide (mg / dm^3)
  • total sulfur dioxide (mg / dm^3)
  • density (g / cm^3)
  • pH
  • sulphates (potassium sulphate - g / dm3)
  • alcohol (% by volume)

Output variable (based on sensory data):

  • quality (score between 0 and 10)

For this analysis, we are mainly looking to answer one question: Which chemical properties influence the quality of red wines?

Univariate Plots Section

To understand a little bit better the Red Wine Quality dataset, the first step is to take a look in the summary of variables contained in it. With this summary it’s possible to check how spread is the values, by checking the min and max values. It’s also possible to have a quick understanding about the the distribution of the data, by comparing the mean and the median values.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Another interesting approach to check the data is to visualize the distribution of each feature. This can be achieved with Histogram Plots, as it is showed bellow:

Univariate Analysis

As mentioned above, the data distribution analysis can be really helpful to have a quick overview about our data and it’s boundaries.

We can classify the distributions as a:

Based on the definitions above and on the summary(mean and median), we can classify the wine properties into of the one distributions mentioned above:

Bivariate Plots Section

Matrix Plot

To have a quick look over the correlation between two features, it’s possible to plot a matrix of plots and values. This plot is showed bellow:

Since the Matrix Plot is a little bit hard to see and the correlation numbers are sliced, I decided to generate a Scatter Plot of the Wine Quality against each feature, with the mean and median also in the plot. Also, I calculated the correlation between Wine Quality and each of the features.

Another comparision that I decided to make was a Boxplot for each of the Wine Quality against the each feature. To do this plot, I needed to add a new variable to our dataset named grade_number which corresponds to a categorical variable. This helped me to see the variaton of the data for each of the Wine Quality.

The plots and calculation are showed bellow:

Variable Analysis: Fixed Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

Variable Analysis: Volatile Acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

Variable Analysis: Citric Acid

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

Variable Analysis: Residual Sugar

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

Variable Analysis: Chlorides

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$chlorides
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066

Variable Analysis: Free Sulfur Dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

Variable Analysis: Total Sulfur Dioxide

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Variable Analysis: Density

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

Variable Analysis: pH

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

Variable Analysis: Sulphates

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Variable Analysis: Alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Bivariate Analysis

The results obtained from the correlation analysis, between each feature and the Quality, were:

Feature Correlation Orientation Strength
fixed.acidity 0.124 Positive Very Weak
volatile.acidity -0.3905 Negative Weak
citric.acid 0.2264 Positive Weak
residual.sugar 0.0137 Positive Very Weak
chlorides -0.1289 Negative Very Weak
free.sulfur.dioxide -0.0506 Negative Very Weak
total.sulfur.dioxide -0.1851 Negative Very Weak
density -0.1749 Negative Very Weak
pH -0.0577 Negative Very Weak
sulphates 0.2514 Positive Weak
alcohol 0.4762 Positive Medium

Table Interpretation: - Strength: Very Weak(0 ~ 0.2), Weak(0.21~0.4), Medium(0.41~0.6), Strong(0.61~0.8) , Very Strong(0.8~1.0); - Orientation: Positive, Negative;

Conclusion:

From the results we can highlight the correlations between Quality and Alcohol, Quality and Volatile Acidity, Quality and Sulphates, and Quality and Citric Acid.The other correlations have really small values, which indicates that don’t have a big impact in the Wine Quality result.

Multivariate Plots Section

In this section, it’s necessary to generate more complex plots, by adding color to the points. This adds a new layer and open the path to the analysis of three variables, instead of only two as was did in the previous plots. For this analysis, we’ll use the variables that had the bigger values for the correlation with the Wine Quality, which from the table above are, alcohol, volatile.acidity, sulphates and citric.acid.

alcohol X volatile.acid X quality:

## $title
## [1] "Alcohol x Volatile Acidity by Quality color"
## 
## attr(,"class")
## [1] "labels"
## 
##  Pearson's product-moment correlation
## 
## data:  wines$volatile.acidity and wines$alcohol
## t = -8.2546, df = 1597, p-value = 3.155e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2488416 -0.1548020
## sample estimates:
##       cor 
## -0.202288

alcohol X sulphates X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$sulphates and wines$alcohol
## t = 3.7568, df = 1597, p-value = 0.0001783
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04477906 0.14196454
## sample estimates:
##        cor 
## 0.09359475

alcohol X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

volatile.acidity X sulphates X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$sulphates and wines$volatile.acidity
## t = -10.804, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3060917 -0.2147125
## sample estimates:
##        cor 
## -0.2609867

volatile.acidity X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

sulphates X citric.acid X quality:

## 
##  Pearson's product-moment correlation
## 
## data:  wines$citric.acid and wines$sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2678558 0.3563278
## sample estimates:
##     cor 
## 0.31277

The results of the correlation between the analised variables are:

Correlation alcohol volatile.acidity sulphates citric.acid
alcohol - -0.2023 0.0936 0.1099
volatile.acidity -0.2023 - -0.2609 -0.5525
sulphates 0.0936 -0.2609 - 0.31277
citric.acid 0.1099 -0.5525 0.31277 -

Multivariate Analysis


Final Plots and Summary

Plot One

Description One

This plot showed the negative correlation between the volatile.acidity and the wine quality.

Plot Two

Description Two

This plot showed the correlation between the feature alcohol and the wine quality. This plot was really important to verify the correlation between the variable and the quakity, with the biggest value of the correlation.

Plot Three

Description Three

This was the most unexpected result that I got in the analysis. Turns out that the lower the value for the volatile.acidity, the bigger is the quality and the citric.acid value.


Reflection

The hardest part of this analysis was that there’s no clear relation between one or two features with the wine quality, and because it’s hard to answer the question that was made in the beggining of this project:
Which chemical properties influence the quality of red wines?

Even that this was a hard question to answer, it was possible to see that the feature more related to wine quality was the alcohol. It has the biggest correlation value and the plots showed that, in general, the bigger is the value of the alcohol, the better the wine is considered.

Also, it’s important to say that the variables volatile.acidity, sulphates and citric.acid have some degree of influence on the wine quality.

For instance, these 4 features mentioned above, never were in my thoughts as the more relevant to the wine quality. I thought it would be pH and residual sugar the ones that actually had an effect in the perception of the wine’s quality. To me this showed how important is Exploratory Data Analysis to have a clear undertanding of informations and that guesses can be completely wrong from the actual information contained in data.